feat: Perform tolerance-based comparison for lists and arrays#19
feat: Perform tolerance-based comparison for lists and arrays#19MariusMerkleQC wants to merge 7 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 10 10
Lines 707 743 +36
=========================================
+ Hits 707 743 +36 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(lhs_type, pl.List) and isinstance(rhs_type, pl.List): | ||
| assert actual.to_list() == [True, False, False] |
There was a problem hiding this comment.
To fix this, I already had a solution that computes the maximum list length among all nesting levels in a "data type tree". For example, if you had a list of lists where the inner lists are longer than the outer lists, max_list_length would be the value of the inner list length. As this increases complexity even more, I'd like to first get this to main and implement it in a follow-up PR.
Motivation
Partially addresses #8. Sequences (lists and arrays) are compared element-wise, iterating over each position in the sequence. Each element is then compared using the standard type-aware logic, so absolute/relative float tolerances and absolute temporal tolerances all apply naturally.
Maximum sequence length
An array's length (or shape for multi-dimensional arrays) is known statically from its data type. When at least one of the two compared columns is an array, its length determines the number of elements to compare. When both columns are lists, the maximum list length must be computed at runtime — this is handled by the cached property
_max_list_lengths_by_column, a dictionary mapping column names to their maximum list length, populated only for columns that arepl.Listin both data frames. The resolvedmax_list_length: int | Noneis then passed tocondition_equal_columns().Sequences of different lengths
Arrays have a fixed length, so comparing two arrays of different shapes can immediately return
False. In all other cases, lengths may vary row-by-row, which is captured in the has_same_length expression. To avoid out-of-bound errors when indexing into shorter lists,null_on_oob=Trueis used instead of raising. The final result combineshas_same_lengthwithelements_match(the element-wise comparison), so rows with mismatched lengths are marked as unequal.Multi-dimensional sequences
Nested sequences (e.g., lists of lists or multi-dimensional arrays) are handled recursively: outer elements are extracted positionally, then compared via the same
_compare_columnslogic until primitive types are reached. When both sides are lists at an inner nesting level, nomax_list_lengthis available, so the comparison falls back to direct equality without element-wise unrolling (i.e., tolerances do not apply at inner list levels).Changes
_max_list_lengths_by_column: dict[str, int]_compare_sequence_columns()to compare lists and arrays with each othertest_condition_equal_columns_list_array_{equal_exact -> with_tolerance}to reflect the updated logictest_condition_equal_columns_nested_list_array_with_tolerancetest_condition_equal_columns_two_lists, including empty lists, lists withNoneandNonetest_condition_equal_columns_array_vs_list_length_mismatchtest_condition_equal_columns_two_arrays_different_shapestest_condition_equal_columns_empty_arraysandtest_condition_equal_columns_empty_lists, respectively